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Abstract 

Inferences that arise from loss functions determined by the prior are 
considered and it is shown that these lead to limiting Bayes rules that 
are closely connected with likelihood. The procedures obtained via these 
loss functions are invariant under reparameterizations and are Bayesian 
unbiased or limits of Bayesian unbiased inferences. These inferences serve 
as well-supported alternatives to MAP-based inferences. 

Key words and phrases: loss functions, relative surprise, lowest posterior risk 
region, Bayesian unbiasedness. 

1 Introduction 

Suppose we have a sampling model, given by a collection of densities {/g : 
9 e 0} with respect to a support measure fi on sample space X ', and a proper 
prior, given by density 7r with respect to support measure i/ on 6. When we 
observe data x these ingredients lead to the posterior on O with density given 
by ir(9 | x) = ir{9)fg(x)/m(x) with respect to support measure v where m{x) = 
f e n(9)f e (x)v(d6)._ 

One can determine inferences based on these ingredients alone. For example, 
suppose we are interested in a characteristic ip — ^(9) where \& : — » \& and 
we let i& stand for both the space and mapping to conserve notation. The high- 
est posterior density (hpd), or MAP-based, approach to determining inferences 
constructs credible regions of the form 

H 7 (x) = {V>o : tt*(-0o I x) > h 7 (x)} (1) 
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where tt^(-\x) is the marginal posterior density with respect to a support 
measure on vp, and /i 7 (a;) is chosen so that /i 7 (x) = sup{&; : H^({ip : 
irqj(i{j\x) > k} | x) > 7}. It follows from ([T} that, if we want to assess the 
hypothesis Hq : *&(8) — i/jq, then we can use the tail probability given by 
1 — inf{7 : O £ H 7 (x)}. Furthermore, the class of sets Hj(x) is naturally "cen- 
tered" at the posterior mode (when it exists uniquely) as H 1 {x) converges to 
this point as 7 — ¥ 0. The use of the posterior mode as an estimator is commonly 
referred to as MAP (maximum a posteriori) estimation. We can then think of 
the size of the set H 7 (x), say for 7 = 0.95, as a measure of how accurate the 
MAP estimator is in a given context. Furthermore, we have that when O is an 
open subset of a Euclidean space, then H 1 {x) minimizes volume among all 7- 
credible regions. The use of MAP-based inferences is very common in machine 
learning contexts, see, for example, Bishop (2006). 

It is well-known, however, that hpd inferences suffer from a serious defect. In 
particular, in the continuous case hpd inferences are not invariant under repa- 
rameterizations. For example, this means that if ipwiAp{x) is the MAP estimate 
of ip, then it is not necessarily true that Y(i/jmap(x)) is the MAP estimate of 
r = T(ip) when T is a 1-1, smooth transformation. The noninvariance of a sta- 
tistical procedure seems very unnatural as it implies that the statistical analysis 
depends on the parameterization and typically there does not seem to be a good 
reason for this. 

A class of inferences, similar to hpd inferences, avoids this lack of invariance. 
These are referred to as relative surprise inferences and are based on the regions 

C 7 (a:) = {ip : tt*(V I a:)/7r»(V>) > c 7 (x)} (2) 

where 7r* is the marginal prior density with respect to a support measure vq, on 
and Cy(x) — supjfc : II* ({^ : 7r*("0 I x) / tt^ (tjj) > k} | x) > 7}. The hypothesis 
Ho : &(0) = ipo is assessed by computing the tail probability 

1 - inf{7 : ijj e C\{x)} = n*(7r*(V> | x)/% 9 (ip) < 7r*(V>o | x)/tt*(-0o) | x). (3) 

We refer to 7r* (ip \ x) /ir^ (if)) as the relative belief ratio of tp as it measures how 
beliefs in ip being the true value change from a priori to a posteriori. The 
relative surprise terminology then comes from ([3]) as this is measuring how 
surprising the value ipo is by comparing its relative belief ratio to the relative 
belief ratios of other values of tp. The corresponding estimator is given by the 
maximizer of the ratio 7r*( , | x)/n<s,( , ip), which we refer to as the least relative 
surprise estimator (LRSE), and denote as "0lr.se (x). Note that 0lr.se (x) is 
the least surprising value as it maximizes ([3]). Beyond their invariance these 
inferences have many optimality properties in the class of all Bayesian inferences 
as documented in Evans (1997), Evans, Guttman and Swartz (2006), Evans and 
Shakhatreh (2008) and Jang (2010). In this paper we will establish optimal 
decision-theoretic properties for relative surprise inferences. 

The idea of measuring surprise based on how beliefs change from a priori 
to a posteriori and using this for inference, has arisen in other discussions. For 
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example, see Baldi and Itti (2010) for the use and development of this idea in 
the context of learning. 

While hpd and relative surprise inferences may seem quite natural, an- 
other ingredient is often added to the formulation of a statistical problem, 
namely, a loss function. For this we have an action space "J, a function \E' : 
9 — > "J, such that is the correct action when 9 is true, and a loss func- 

tion L : 9 x f -> [0, oo ) satisfying L{d,^{6)) = 0, i.e., there is no loss 
when we take the correct action. The goal of a statistical decision analy- 
sis is then to find a decision function 5 : X — > *f? that minimizes the prior 
risk r(S) — J Q f x L(9, S(x))f$(x)n(9) fi(dx) u(d9) — f x r(5\x)m(x) ^(dx) where 
r(S\x) = J e L(0,S(x))n(9 \ x)v{dQ) is the posterior risk. Such a 5 is called a 
Bayes rule and clearly a S that minimizes r(S | x) for each X is a Bayes rule. 
Further discussion of decision theory can be found in Berger (1985). 

As noted in Bernardo (2005) a decision formulation also leads to credible 
regions for if>, namely, a ^-lowest posterior loss credible region is defined by 

L 7 (x) = {^:r{^\x) <Z 7 (x)} (4) 

where Z 7 (x) = inf{fc : J {4 , . r ^ Q | x) < k} tt*(^ | x) ^(dif;) > 7}. Note that tp in (0} 
is interpreted as the decision function that takes the value if> constantly in x. 
Clearly as 7 —> the set L^(x) converges to the value of a Bayes rule at x. 
For example, with quadratic loss the Bayes rule is given by the posterior mean 
and a 7-lowest posterior loss region is the smallest sphere centered at the mean 
containing at least 7 of the posterior probability. 

Typically, in the continuous context, Bayes rules will not be invariant un- 
der reparameterizations. Robert (1996) recommended using the intrinsic loss 
function based on a measure of distance between sampling distributions as 
Bayes rules with respect to such losses are invariant. Bernardo (2005) rec- 
ommended using the intrinsic loss function based on the Kullback-Leibler diver- 
gence KL(fg, fgi) between fg and fgi. When ip = 9 the intrinsic loss function 
is given by L(9, 9') = min(KL(fg, fg>), KL(fgi , fg)). For a general marginal pa- 
rameter ip the intrinsic loss function is defined by L{9, ip) = infg/ 6 ^-i{-^,i. £(#, 9'). 

It can be shown, for example see Bernardo and Smith (2000) and Section 
4, that hpd inferences arise as the limits of Bayes rules via a sequence of loss 
functions given by 

L x (0,il>) = I(*(0) # B x y>)) (5) 

where A > and B\( , ii(9)) is the ball of radius A centered at ip. As previously 
noted these inferences are not invariant under reparameterizations. It is our 
purpose here to show that relative surprise inferences also arise via a sequence 
of loss functions similar to ([5]) but based on the prior. So the loss functions are 
also in a sense intrinsic but based on the prior and not the sampling model, as 
with the intrinsic loss function. 

In Section 2 we develop the prior-based loss function and show that V'lrse 
is a Bayes rule when *5> is finite. In Sections 3 and 4 we extend this result to 
show that f/'LRSE is generally a limit of Bayes rules. In Section 5 we discuss 
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prediction problems and in Section 6 show that relative surprise regions are 
limits of 7-lowest posterior loss credible regions. 

It is easy to see that the class of relative surprise credible regions {C 7 (x) : 
7 G [0, 1]} for ip is independent of the marginal prior 7r*. We note, however, that 
when we specify a 7 G [0,1], the set C 7 (x) does depend on irq, through c 7 (x). 
So the form of relative surprise inferences about ip is completely robust to the 
choice of ir^, but the quantification of the uncertainty in the inferences is not. 
For example, when ip = ^(9) = 9, then 9lrse(x) is the MLE while, in general, 
V'lrse (x) is the maximizer of the integrated likelihood where we have integrated 
out nuisance parameters via the conditional prior given ip. Similarly, relative 
surprise regions are likelihood regions in the case of the full parameter, and 
integrated likelihood regions generally. As such, the results derived in this paper 
establish that likelihood inferences are essentially Bayesian in character. We 
note, however, that a relative belief ratio Tr^(ipQ | x) / 'ir^, (ipo) , while proportional 
to an integrated likelihood, has an interpretation as a change in belief and cannot 
be multiplied by an arbitrary positive constant, as with a likelihood, without 
losing this interpretation. 

In Le Cam (1953) it is shown that the MLE is asymptotically Bayes but this 
is for a fixed loss function, with increasing amounts of data and a sequence of 
priors. In this paper the amount of data and the prior are fixed but we may 
require a sequence of loss functions, to show that the MLE is a limit of Bayes 
rules. Berger, Liseo and Wolpert (1999) discuss maximum integrated likelihood 
estimates where default or noninformative priors are used to integrate out nui- 
sance parameters and show good properties for this approach. Aitkin (2010) 
develops an approach to assessing hypotheses using the posterior distribution 
of likelihood ratios that is based on earlier work by Dempster (1973). As that 
approach does not use integrated likelihoods and, as of this time, doesn't have 
a decision-theoretic formulation, it is quite different than what we discuss here. 

2 Estimation from Prior-based Loss Functions: 
The Finite Case 

The following theorem presents the basic definition of the loss function when 
VE' is finite and establishes an important optimality result. For more general 
situations we will need to modify this loss function slightly. 

Theorem 1. Suppose that 7r* (ip) > for every ip G ^ and that VP is finite with 
equal to counting measure. Then for the loss function 

a Bayes rule is given by V'lrse- 
Proof: We have that 
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r(5 | x) = 




1(^(8) I 8{x)) 

7T*(tf(0)) 



ir(9 I x) v{d9) = 
. %^(6(x) I a;) 



f Ity £ 5(x)) 

* 7T* WO 



iTxs/(ifi | x) v^{dijj) 



= 1 



7T* (6{x)) 



(7) 



Since \1/ is finite, the first term in ([7]) is finite and a Bayes rule at a; is given by 
the value d(x) that maximizes the second term. Therefore, V'lrse^) is a Bayes 
rule. 

From ([7]) the prior risk of 5 is 



r(S) = #(*) - £ M (M<^) | x)/nv{6(x))) = £ M^(5(x) ^ V) (8) 



where By denotes expectation with respect to the prior predictive and is 
the probability measure on X obtained by averaging Pg using the conditional 
prior given that tf(0) = ip, namely, My (A) = f^-i m P (A)IL(d0 | *(0) = ip). 
Therefore, finding a Bayes rule with respect to ([6]) is equivalent to finding 8 
that maximizes Em{^^{8{x) \ x)/t:^(S(x))). So a Bayes rule maximizes the prior 
expected relative belief ratio evaluated at the estimate and it is clear that the 
LRSE is a Bayes rule as it maximizes the relative belief ratio for each x. 

If instead we take the loss function to be 1(^(9) ^ ip), then virtually the 
same proof establishes that i/'map is a Bayes rule. The prior risk for this loss 
function and estimator 8 can be written as 



which is the prior probability of making an error. Both ^ ip) and 

are two- valued loss functions but, when we make an incorrect decision, the loss 
is constant in ^>(9) for I(^>(9) ^ ip) while it equals the reciprocal of the prior 
probability of ^(9) for ©. So ^ penalizes an incorrect decision much more 
severely when the true value of $?(6) is in the tails of the prior. This makes 
sense as we would want to override the effect of the prior when the prior is not 
placing appreciable mass at the true value. Note that ipMAP = V'lrse when 
is uniform. 

As we have already noted ir^ (ip\x)/ tt^ (ip) is proportional to the integrated 
likelihood of ip when we integrate the likelihood with respect to the conditional 
prior of 9 given ip. So, under the conditions of Theorem 1, we have shown that 
the maximum integrated likelihood estimator is a Bayes rule. Furthermore, the 
Bayes rule is the same for every choice of 7r>p and only depends on the full prior 
through the conditional prior placed on the nuisance parameters. When ip — 9 
then i/'lrse(2') is the MLE of 8 and so the MLE of 9 is a Bayes rule for every 
prior 7r. 

We consider an application. 
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Example 1. Classification 

For a classification problem we have k categories {0i, . . . , 0fc} prescribed by 
some function where 7r^(0i) > for each i. Based on observed data x we 
want to classify the data as having come from one of the distributions in the 
classes specified by <I'~ 1 {0 i }. 

The standard Bayesian solution to this problem is to use i^map^) as the 
classifier. From (|9]) we have that 0MAp(aO minimizes the prior probability of 
misclassification. Note that M^{8{x) ^ 0) is the prior probability of a mis- 
classification given that ip is the correct class and © is the weighted average 
of these probabilities where the weights are given by the prior probabilities of 
the 0. We see from ([5]) that V'lrse(^) is instead minimizing the sum over of 
the probabilities of misclassification given that is the correct class. So the 
essence of the difference between these two approaches in this problem is that 
V'lrse(^) treats the errors of misclassification equally while 0map(£) weights 
them by their prior probabilities of occurrence. 

We note that §5§ is an upper bound on ©. So if the Bayes risk for loss 
function © is small, the prior risk of 0lrse (x) , with respect to the loss function 
1(^(9) ^ 0), is also small, i.e., when using 0lrse(£) the overall prior probability 
of a misclassification will also be small. 

In general, it seems appropriate to be concerned with minimizing each of 
the probabilities M^(5{x) ^ 0) and not downweight those corresponding to 
values that have small prior probability. As a specific simple example suppose 
k = 2 and x ~ Binomial(0i) or x ~ Binomial(02) with 7r(0i) = 1 — e and 
7r(02) = £• After observing x we want to classify the observation. For example, 
ipi could be the probability of a diagnostic test for a disease indicating that the 
disease is present. We suppose that 0i is the probability of a positive diagnostic 
test for the nondiseased population while 02 is this probability for the diseased 
population. Further suppose that 0i/02 is very small, indicating that the test 
is successful in identifying the disease while not yielding many false positives, 
and suppose e is very small, indicating that the disease is very rare. We have 
that tt(V>i 1 1) = X (1 - e)/ty>i(l - e) + 2 e) and tt(Vi | 0) = (1 - 0i)(l - e)/((l - 
0i)(l - e) + (1 - 2 )e). Therefore, 0map(1) = 01 if 0i/02 > e/(l - e) and 
is 2 otherwise, while 0map(O) = "01 if (1 — 0i)/(l — 02) > — e) and 
is ip2 otherwise. Also 0lrse(1) = 01 if 01 > 02 and is 2 otherwise, while 
V'lrse(O) = ipi if (1 — 0i) > (1 — 02) and is 2 otherwise. So we see from this 
that 0map will always classify a person to the nondiseased population when e 
is small enough, e.g., take ipi = 0.05,02 = 0.80, and e < 0.0566. By contrast, 
in this situation, 0lrse will always classify an individual with a positive test 
to the diseased population and to the nondiseased population for a negative 
test. Now is the Binomial(0j) distribution, so when ipi < 2 and e is small 
enough 

M^(0MAP ^ 0l) + M^(0MAP ^ 02 ) = + 1 = 1, 

Afy, (0LRSE + 0l) + (0LRSE + 02 ) = 01 + (1 - 2 ) < 1- 

This illustrates clearly the difference between these two procedures as 0lrse 
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does vastly better than V'map on the diseased population when ipi is small and 
ip2 is large as would be the case for a good diagnostic. Of course f/'MAP minimizes 
the overall error rate but at the price of ignoring the most important class in 
this problem. Note that this example can be extended to the situation where 
we need to estimate the tpi based on samples from the respective populations 
but this will not materially affect the overall conclusions. Also see Example 3 
where e is considered unknown. 

In a general estimation problem an estimator S is unbiased with respect to 
a loss function L if E e (L{0',S(x))) > E e (L(9,6(x))) for all 9', 9 6 6. This says 
that on average 5(x) is closer to the true value than any other value when we 
interpret L(9,S(x)) as a measure of distance between the estimate and what is 
being estimated. A reasonable definition of Bayesian unbiasedness for 6 with 
respect to L is thus obtained by requiring that 

E g {L(9',S(x)))U(d9)U(d9')> [ E e (L(9, S(x))) H(d0) = r{8). 
e Je Je 

Here we are thinking of 9' as a false value generated from the prior independently 
of the true value 9 so 9' has no connection with the data. Therefore, S is Bayesian 
unbiased if on average 6(x) is closer to the true value than a false value. In 
Section 3 we prove that V'lrse is Bayesian unbiased with respect to a general 
class of loss functions that includes both (O and 1(^(9) ^ ijj). 

3 Estimation from Prior-based Loss Functions: 
The Countably Infinite Case 

The loss function ([6]) does not provide meaningful results when ^ is infinite as 
([5| shows that r(8) will be infinite. So we modify © via a parameter rj > 
and define the loss function 

and note that L v is a bounded function of (9,ip). This loss function is like ^ 
but does not allow for arbitrarily large losses. Without loss of generality we can 
restrict 77 to a sequence of values converging to 0. We prove the following result 
in the Appendix. 

Theorem 2. Suppose that Tr^(ip) > for every ip G ^, that ^ is countable 
with vq, equal to counting measure and that V'lrse^) is the unique maximizer 
of ir^(ip I x)/7r^(7/>) for all x. For the loss function (jTOj) and Bayes rule 8 V , then 
8 v (x) y iPlrse(x) as 77 — > 0, for every x € X. 

The proof of Theorem also establishes the following result. 

Corollary 3. For all sufficiently small 77 the value of the Bayes rule at x is 

given by ^lrse(^)- 
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If instead we take the loss function to be 1(^(9) ^ ip), then virtually the same 
proof as in Theorem 1 establishes that ipMAP is a Bayes rule. 

We now investigate the unbiasedness of ^lrse(^)- For this we consider loss 
functions of the form 

L(M) = -W) ^VW)) (H) 

for some nonnegative function h which satisfies J Q h(^(9)) U(d9) < oo. This 
class of loss functions includes (0 when \£ is finite, (fTU)l and 1(^(9) ^ i/j). We 
have the following result. 

Theorem 4. If \& is countable, then i/toSE^) is Bayesian unbiased under the 
loss function (fTT|) . 

Proof: The prior risk of 8 is given by 

r(<?)= f [ L(9,S(x))P 9 (dx)U(d9) 
Je Jx 

= f f [h{*(0)) - Z(¥(0) = 6(x))h(V{9))} Pe(dx) Tl(d9) 
Je Jx 

= [ h(^(9))U(d9) - [ [ I(^(9)=S(x))h(^(9))U(d9\x)M(dx) 
Je JxJe 

= [ h(^(9))U(d9) - [ h(S(x))n^(5(x)\x)M(dx) 
Je Jx 

and 

mL{9', 8{x)) P e (dx) U(d9) U(d9 r ) 

m[h(V(6 r )) - 1(^(9') = S(x))h(^(9'))} Pg(dx) U(d9) U(d9 r ) 

= [ h(^(9))U{d9) - [ h(5(x))7ry{8{x))M(dx). 
Je Jx 

Therefore, 6 is Bayesian unbiased if and only if 

f h(8(x))[wv(8(x) | x) - kv(6(x))] M(dx) > 0. (12) 
Jx 

It is a consequence of results proved in Evans and Shakhatreh (2008) that it is 
always true that 7np (V'lrse {x) | x)/w->s>(iPi,rse(%)) > 1 and this establishes the 
result. This can also be seen by noting that 7r^(- |a;)/7r^(-) is the density of 
n^(- 1 x) with respect to n^, and so we must have that the maximum of this 
density is greater than or equal to 1. 

The proof gives a sufficient condition for Bayesian unbiasedness with respect to 
the loss CD]) . 

Corollary 5. 8 is Bayesian unbiased if tt^(8(x) | a;) > tt^(5(x)) for all x. 

At this point we have neither a proof of the Bayesian unbiasedness of i/'map 
with respect to I(*S>(9) ^ if)), nor a counterexample although we suspect that 
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it is not. We do know, however, that i/'map is Bayesian unbiased with respect 
to 1(^(0) ^ ip) whenever II* is uniform because in that case i/jmap = "0lr.se- 
It is also clear from (fTTl) that Vlr.se possesses a very strong property as the 
integrand is always nonnegative when 5 = Vlrse- In light of this we refer to an 
estimator possessing this property as being uniformly (in x) Bayesian unbiased. 

4 Estimation from Prior-based Loss Functions: 
The Continuous Case 

When ip has a continuous prior distribution the argument in Theorem 2 does 
not work as U^,({S(x)} \ x) = 0. There are several possible ways to proceed 
here but we consider a discretization of the problem that uses Theorem 2. For 
this we will assume that the spaces involved are locally Euclidean, mappings 
are sufficiently smooth and take the support measures to be the analogs of 
Euclidean volume on the respective spaces. Further details on the mathematical 
requirements underlying these assumptions can be found in Tjur (1974) where 
spaces are taken to be Riemann manifolds. While the argument we provide 
applies quite generally, we simplify this here by taking all spaces to be open 
subsets of Euclidean spaces and the support measures to be Euclidean volume 
on these sets. 

For each A > we discretize the set \I> via a countable partition {Bx{ip) : 
ip e where ip G B x (ip), Il^(Bx(ip)) > 0, sup v , e ^diam(B A (V)) -> as A -> 0. 
For example, the B\(tp) could be equal volume rectangles in R k . Further, we 
assume that Il\s,(Bx(ip))/i'^(Bx(ip)) — > TTy(ip) as A — > for every ip. This will 
hold whenever 7r$ is continuous everywhere and B\(ip) converges nicely to {ip} 
as A — > (see Rudin (1974), Chapter 8 for the definition of 'converges nicely'). 
Let ipx(ip) 6 B\(tp) be such that ipx{ip') = ipx(*P) whenever ip' e B\(ip) and 
*&x = {V'a(V') : VaCVO G B\(ip)} be the discretized version of Note that one 
point is chosen in each B\(ip). We will call this a regular discretization of fy. 
The discretized prior on ^\ is 7T^ t x(ipx(tp)) = Hy(Bx(ip)) and the discretized 
posterior is ,\{ip\{ip) \ x) — Hy(B\(ip) \ x). 

We define the loss function for the discretized problem just as for Theorem 
2, by 

max(?7, 71-*, A (V>a (*(#)))) 
and denote a Bayes rule for this problem by 8\^{x). In this case we not only 
need that VlrseOe) is the unique maximizer of Ky(ip \ x) /it^,(ip), but we cannot 
allow 7r* {ip | x) /7r* (ip) to come arbitrarily close to its maximum outside a neigh- 
borhood of Vlrse(^)- It is clear that when this does not hold then we are in a 
pathological situation that will not apply in a typical application. The following 
result is proved in the Appendix. 

Theorem 6. Suppose that 7r* is positive and continuous and we have a regular 
discretization of Further suppose that V'lrse(^) is the unique maximizer of 
7T*(^ | x)/iry(ip) and for any e > 



Lx, n {0,ipx{ip)) = z^ZTZ _ 77. t^.taww ( 13 ) 
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7T* (tp \x) 7T* (V'LRSE (x) \x) 

sup — rrr < — n rrr- 

Then, there exists 77(A) > such that a Bayes rule <5a.jj(a) (x) converges to 
V'lrseM as A — ► for all x. 

Theorem 6 says that i/'lrse is a limit of Bayes rules. So when *f?(6) = 9 wc 
have the result that the MLE is a limit of Bayes rules and more generally the 
maximum integrated likelihood estimator is a limit of Bayes rules. 

Now let ip\{x) be the LRSE of the discretized problem, i.e., ip\( x ) maximizes 
II*(-Ba(V0 I x )/H^(B\('ip)) as a function of 1/; £ The following result is 
proved in the Appendix. 

Corollary 7. xl>\ converges to ^lrse as A — > 0. 

Note that by Theorem 4, ip\ is uniformly Bayesian unbiased for the discretized 
problem. Therefore, "0lrse is the limit of uniformly Bayesian unbiased estima- 
tors. 

By similar arguments we can establish an analog of Theorem 6 for V'map 
using the loss function given by ((5]). Actually in this case a simpler develop- 
ment can be followed in certain situations. For this note that the posterior risk 
of 5 is given by 1 — U^(B\(S(x)) \ x) = 1 — 7r*((5'(ir) | x)v^{B\{5{x))) for some 
S'(x) 6 B\(5(x)). Now suppose we take B\(ip) to be a sphere of radius A centered 
at ip. Suppose further that for each e > there exists a A(e) > such that when 

- iPmap(x)\\ > A(e) then Tns(ip\x) < inVes Me) (v>MAp(:tO) tt*(^' | x). Since 
vy(B\(ip)) is constant we have that a Bayes rule S\ must then satisfy ||<5a(x) — 
V'map (x) 1 1 < e. So we have proved that V'map is a limit of Bayes rules. By con- 
trast, for the loss function I{^(9) g B x (ip))/U^(Bx( l( b(9))) the posterior risk of 

6 is given by {n*(B A (v))}- x n*(# I x)- j Bx{s{x)) {n^(B x (^))}- 1 n*w I x). 

The simpler approach is not available in this case because the first term is un- 
bounded. 

We consider now an important example. 

Example 2. Regression (estimation) 

Suppose that we have y = X/3 + e where y e R n ,X e R nxk is fixed, 
/3 6 R nxk , and e ~ iV„(0, ct 2 /). We will assume that a 2 is known to simplify 
the discussion. Let n be a prior density for (3. Then having observed (X, y), 
/3lrse(2;) = b = (X' X)~ 1 X'y which is the MLE of j3. It is interesting to contrast 
this result with what might be considered more standard Bayesian estimates 
such as the posterior mode or posterior mean. For example, suppose that f3 ~ 
Nk(0, t 2 I). Then the posterior distribution of (i is A^(/i post (/3), E post (/3)) where 

MpostCS) = Z post {f3)a- 2 X'Xb, EportGS) = (r" 2 / + a^X'X)- 1 

and the posterior mean and modal estimates of /3 are both equal to /x p0 st(/3)- 
Writing the spectral decomposition of X'X as X'X — QAQ' we have that 

||^po S t(/3)|| = ||(/+(«T 2 /r 2 )A- 1 )- 1 g'6||. 
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Since ||6|| = ||<3'6|| and 1/(1 + a 2 / (r 2 A,)) < 1 for each i, we see that n p0 st(/3) 
moves the MLE towards the prior mean 0. This is often cited as a positive 
attribute of these estimates but consider the situation where the true value of (3 
lies in the tails of the prior. In that case it is certainly wrong to move /3 towards 
the prior mean. When t 2 is chosen very large, so we avoid the possibility that 
the true value of (3 lies in the tails of the prior, then the MLE and the posterior 
mean are virtually the same. It makes sense to choose r 2 > a 2 as this says we 
have less prior information about a /3j than the amount we learn about f3 from 
a single observation. So it is not clear that shrinking the MLE is necessarily a 
good thing particularly as this requires giving up invariance. 

Suppose now we want to estimate ip — w'/3 for some setting w of the predic- 
tors. The prior distribution of ip is N(0, a 2 riol (ip)) = N(0,t 2 w'w) and the 
posterior distribution is N(/j, post (ip),a 2 ost (ip)) = N (w' fj, post (P) , x'T, post (f3)x). 
Note that a%- 10 M) - ^ost(V') - w'(t 2 I - £ port (£))«; = t 2 w'Q'(I - (I + 
(t 2 /<j 2 )A)~ 1 )Qw > and so maximizing the ratio of the posterior to prior 
densities leads to 

V'LRSE(y) = (1 - CTp 0st (l/')/cr 2 rior (V')) _ Vpo S t(V')- (14) 

Since cr 2 rior (^) > 0- 2 ost (V>) we have |V>lrse(2/)| > |MpostO)| and n post (i>) = 
, 0MAp(y)- Note that when <J post (i>) is much smaller than cr 2 rior (i/'), in other 
words the posterior is densely concentrated about A* p0 st(' ! /')) then wlrse(j/) and 
WMAp(y) are very similar. In general iPlrse(u) is not equal to w'b, the plug-in 
MLE of ip, although V'lrse(2/) — > w'b as r 2 — > oo. 



5 Prediction from Prior-based Loss Functions 

Suppose after observing x we want to predict a future value y e y where 
y has model given by g v ^(y\x) with respect to support measure /iy on y. 
We allow for the possibility here that the distribution of y depends on x and 
also that 9 may not index these distributions. Then we have that the joint 
density of (9,x,y) is given by TT(9)fg(x)g v ^(y \ x) and after observing x the 
conditional density of y is given by the posterior predictive density q(y | x) = 
J e tt(9 I x)g v (0)(y | x) v(d9) while the prior predictive density of y is given by 
— Jo fx 7r (^)/ e ( x )ff';(e)(y I x ) n{dx) v(d9). Therefore, the relative belief in a 
future value y is given by q{y \ x)/q{y) and we denote the maximizcr of this by 
Vlksk{x). 

Again the LRSE arises from loss function considerations. For example, when 
y is finite we consider the loss function 



where we think of y as some true value of y that is concealed from us by the 
future, or some other mechanism, and which we want to predict. Then the 
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posterior risk of a predictor S : X —> y is given by 



r(S | a;) = 



f q(v I x ) 
y i{v) 



q{S(x)\x) 

q(S(x)) 



and we see that j/lrse is a Bayes rule. Also, the prior risk of predictor 5 is given 
by r(S) = ^ y M y (8{x) y) where M y is the conditional prior predictive of x 
given y and so r(S) is the sum of the conditional prediction errors given y. We 
can also develop results similar to Theorems 2 and 6 for the situation where y 
is not finite to show that j/lrse is a limit of Bayes rules. 
We consider some examples. 

Example 3. Classification (prediction) 

Consider now a situation where (x, c) is such that x\c ~ f c with c ~ 
Bernoulli (e) where /o and f\ are known (or accurately estimated based on large 
samples) but e is unknown with prior 7T. This is a generalization of Example 1 
where e was assumed to be known. Then based on a sample (xi, Ci), . . . , (x n , c n ) 
from the joint distribution we want to predict the value c n +\ for a newly observed 
x n +i. Therefore, q(c) — J (l — e) 1_c e c 7r(e) de and, if e ~ Beta(a,/3), the prior 
predictive of c n+ i is Bernoulli (a/ (a + /3)). For c„ + i the posterior predictive den- 
sity is q(c \ (xi,ci), . . . , (x n ,c n ),x n+ i) cx (fa(x n+ i)) 1 ^ c (fi(x n+ i)) c J e nc+c {l - 
£ y 1 (i-e)+(i-c) 7r ( e ) de with g _ „-l ^™ =i 

c^. With a Beta(a, /3) prior for e, we have that q(c \ (xi, ci), . . . , (x%, ci), x n +i) 

/c(2 ; n+i)r (a + nc + c) T(/? + n(l — c) + 1 — c). From this we see immediately 
that 



Note that cmap and clrse are identical whenever a — j3. 

We can see from these formulas that a substantial difference will arise be- 
tween Cmap and Clrse when one of a or /3 is much bigger than the other. As in 
Example 1 these correspond to situations where we believe that e or 1 — e is very 
small. Suppose we take a = 1 and let j3 be relatively large, as this corresponds 
to knowing a priori that e is very small. Then (|15p implies that cmap < clrse 
and so clrse = 1 whenever cmap = 1. A similar conclusion arises when we take 
j3 = 1 and a < 1. 

To see what kind of improvement is possible we consider a simulation. Here 
we take /o to be a N(0, 1) density, fi to be a N(fi, 1) density, let n = 10 and the 
prior on e be Beta(l,/3). Table 1 presents the Bayes risks for cmap and clrse 
for various choices of /3 when fj, = 1. When /3 = 1 they are equivalent but we 
see that as /3 rises the performance of cmap deteriorates while clrse improves. 
Large values of j3 correspond to having information that e is small. When j3 — 14 
about 0.50 of the prior probability is to the left of 0.05, with /3 = 32 about 0.80 




(15) 
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A/ (CMAP ^ 0) + MiCcMAP ^ 1) 


Mo(clrse ^ 0) + Mi (clrse + 1) 


1 


0.386 + 0.390 = 0.776 


0.386 + 0.390 = 0.776 


14 


0.002 + 0.975 = 0.977 


0.285 + 0.380 = 0.665 


32 


0.000 + 0.997 = 0.997 


0.292 + 0.349 = 0.641 


100 


0.000+ 1.000= 1.000 


0.300 + 0.324 = 0.624 



Table 1: Conditional prior probabilities of misclassification for MAP and LRSE 
for various values of j3 in Example 3 when a = 1, ji = 1, and n=10. 



of the prior probability is to the left of 0.05, and with /3 = 100 about 0.99 of the 
prior probability is to the left of 0.05. We see that the misclassification rates 
for the small group (c = 1) stay about the same for clrse as /3 increases while 
they deteriorate markedly for cmap as the MAP procedure basically ignores the 
small group. 

We also investigated other choices for n and fi. There is very little change 
as n increases. When fj, moves towards the error rates go up and go down as 
/i moves away from 0, as one would expect. Of course, clrse always dominates 
cmap- 

Example 4. Regression (prediction) 

Consider the situation of Example 2 and suppose we want to predict a 
response z at the predictor value w £ R k . When /3 ~ Nk(0,T 2 I) the prior 
distribution of z is z ~ N(0, a 2 + t 2 w'w) = N(0, a 2 rim .(z j) and the posterior 
distribution is N(fj, post (z),a 2 ost (z)) where 

MpostO) = w'fi post (p), o-p OSt 0) = fi 2 + w'E post (P)w. 

To obtain ZLRSE(y) we need to maximize the ratio of the posterior to the prior 
density of z and an easy calculation shows that this leads to 

zlrseO) = (1 - CTp OSt (z)/CTp riol .(z)rV P o S t(z)- (16) 

Note that cr 2 rior (z) - cr 2 ost (z) = cr^^u/^) - al ost (w'/3) > and so |z L rse(2/)| 
> |// p0 st(z)| and the LRSE is further from the prior mean than zmap(2/) = 
Mpost(-s)- Also, we see that, when a 2 ost (z) is small then zlrse(2/) and zmap(z/) 
are very similar. Finally, comparing (|14[) and ()16j) we have that 

zlrse{v) = (o'p rior (2;)/a 2 ost (V'))w'-0LRSE(y) = (1 + cr 2 /-r 2 )^ L RSE(y) 

and so the LRSE predictor at x is more dispersed than the LRSE estimator of 
the mean at w and this makes good sense as we have to take into account the 
additional variation due to prediction. By contrast Wmap(2/) = ^MAp(y)- 

6 Regions from Prior-based Loss Functions 

We now consider the lowest posterior loss 7-credible regions that arise from the 
prior-based loss functions we have considered. Let C~ l {x) denote a 7- relative 
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surprise region for ip. Consider first the case where "J is finite. We have the 
following result. 

Theorem 8. Suppose that TT^(ip) > for every i/j £ ^ and that VP is finite with 
Via equal to counting measure. Then for the loss function given by ©, C 7 (x) 
is a 7-lowest posterior loss credible region. 

Proof: From (QJ and ([7]) the 7-lowest posterior loss credible region is 



and £ 7 (x) = inf{fc : II* ({^ : r(ip \ x) < k} > 7}. As /j,(7r*(z | x) /V* (z))v^ (dz) 
is independent of tp it is clearly equivalent to define this region via C 7 (x) = 
{ip : ir-q,(ijj x) / ir-q, > c 7 (x)} , namely, Zr 7 (x) = C 7 (x). 

Now consider the case where VP is countable and we use loss function (fTU|) . 
Following the proof of Theorem 8 we see that a 7-lowest posterior loss region 
takes the form 



where l vn {x) = supjfc : II*({-0 : ir<n(ip \ x)/ max(ry, mz(ip)) > k} > 7}. We prove 
the following result in the Appendix. 

Theorem 9. Suppose that 7r*(?/>) > for every if> E that ^ is countable 
with 1/* equal to counting measure. For the loss function (|10[) . we have that 
C 7 (x) C liminf^^o L vn (x) whenever 7 is such that n*(C 7 (a;) | x) — 7 and 
limsup^o L nn (x) C Cy(x) whenever 7' > 7 and II*(Cy(x) | x) = 7'. 

While Theorem 9 does not establish the exact convergence lim„_j.o L„ ~ (x) — 
C 7 (x) we suspect, however, that this does hold under quite general circum- 
stances due to the discreteness. Theorem 9 does show that limit points of the 
class of sets L vn (x) always contain C 7 (x) and their posterior probability con- 
tent differs from 7 by at most 7' — 7 where 7' > 7 is the next largest value for 
which we have exact content. 

We now consider the continuous case and suppose we have a regular dis- 
cretization. For S* C *a = {i>\(ip) ■ i>k{i>) S B\(ip)}, namely, S* is a subset 
of a discretized version of "J, we define the undiscretized version of S* to be 
S = yj^<zs»B\(ip). Now let CI (x) be the 7-relative surprise region for the dis- 
cretized problem and let (7\. 7 (x) be its undiscretized version. Note that in a 
continuous context we will consider two sets as equal if they differ only by a set 
of measure with respect to II* . In the Appendix we prove the following which 
says that a 7-relative surprise region for the discretized problem (after undis- 
crctizing) converges to the 7-relative surprise region for the original problem. 

Theorem 10. Suppose that 7r* is positive and continuous, we have a regular 
discretization of "J and 7r*(-0 | x)/7r*(^>) has a continuous posterior distribution. 
Then liniA-j-o C\.~ l {x) — C 7 (x). 

While Theorem 10 has interest in its own right, we can use it to prove that 
relative surprise regions are limits of lowest posterior loss regions. 




L vn (x) = {ip : 7r*(V> I x)j max(ry,7r*('0)) > lr,,-y(x)} 
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Let L* x (x) be the 7-lowest posterior loss region obtained for the discretized 
problem using loss function (1131) and let L,,,a, 7 (x) be the undiscretized version. 
We prove the following result in the Appendix. 

Theorem 11. Suppose that ir^ is positive and continuous, we have a regular 
discretization of "J and n-q, (ip \ x) /tt^ (ip) has a continuous posterior distribution. 
Then C 7 (x) = hmliminf L„ \ 7 (x) — lim lim sup L„ \ j(x). 

A— >0 rj— M ' ' A— >0 >0 ' 

In Evans, Guttman, and Swartz (2006) and Evans and Shakhatreh (2008) ad- 
ditional properties of relative surprise regions are developed. For example, it is 
proved that a 7-relative surprise region C 7 (x) for ip satisfying ILp(C 7 (x) | x) = 7 
minimizes H^(B) among all (measurable) subsets of * satisfying II^(_B | x) > 7. 
So a 7-relative surprise region is smallest among all 7-credible regions for ip 
where size is measured using the prior measure. This property has several con- 
sequences. For example, the prior probability that a region B(x) C $ contains 
a false value from the prior is given by J e P$(ip 6 B{x)) Hy(dip) H(d9) where 
a false value is a value of ip ~ 11^ generated independently of (6, x) ~ II 5, x Pg. 
It can be proved that a 7-relative surprise region minimizes this probability 
among all 7-credible regions for ip and is always unbiased in the sense that the 
probability of covering a false value is bounded above by 7. Furthermore, a 7- 
relative surprise region maximizes the relative belief ratio H^(B \ x)/H^(B) and 
the Bayes factor ILj(B | x)ILq,(B c )/Ily(B c \ x)II^(B) among all regions Be* 
with = n*(C 7 (x) I x). 

While the results in this section have been concerned with obtaining credible 
regions for parameters, similar results can be proved for the construction of 
prediction regions. 

7 Conclusions 

Relative surprise inferences are closely related to likelihood inferences. This to- 
gether with their invariance and optimality properties make these prime candi- 
dates as appropriate inferences in Bayesian contexts. This paper has shown that 
relative surprise inferences arise naturally in a decision-theoretic formulation us- 
ing loss functions based on the prior. As of yet these inferences are not typically 
used while MAP-based inferences, which seem to possess few strong properties, 
are commonly recommended. Based on the properties we have discussed in 
this paper we conclude that improvements in inferences can be accomplished 
by adopting relative surprise inferences. While we have required proper priors 
in this paper, limiting relative surprise inferences, as priors become increasingly 
diffuse, can also be obtained and have been discussed in the references. 

Relative surprise estimation of the parameter -0 is based on the relative 
belief ratio n-i&(ip \ x) / ir\& (ip) . As this ratio is independent of the choice of 7r$, 
estimation of ip is to a certain extent robust to the choice of prior. The role of 
the marginal prior tt^, arises in quantifying the uncertainty about the estimate 
of tf> through the regions C 7 . So the conditional prior given ?/>, together with 
the model and data, are used to determine the form of any inferences about ip, 
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while the marginal prior for ip, together with the model and data, are used to 
quantify the uncertainty in these inferences. 

By contrast predictions are based on the relative belief ratio q(y\x)/q(y) 
which is generally dependent on the full prior tt. So in a sense predictions are 
less robust to the prior than estimation. On the other hand Bayesian inferences 
are often advocated due to the regularizing effect of the prior. While the rela- 
tive surprise approach does not fully incorporate such an effect for parameter 
estimates, the full effect is available for prediction. 



Appendix 

Proof of Theorem 2: We have that 

r n (6\x)= f J( y^L t(^k)^(#) 
max(77,7r*(V>)) 



Mdi>)- (17) 



maxf)), 7r*(^)) max(?7, n&(5(x))) 

The first term in (|17|) is constant in 6(x) and bounded above by I/77, so the 
value of a Bayes rule at x is obtained by finding S(x) that maximizes the second 
term. 

Consider 77 as fixed and note that 

n*(5(x)\x) f ^flA if ri >^(S(x)) 

max^OKx))) \ *ZV$ X \? ifr,<M6{x)). [ ' 

There are at most finitely many values of ip satisfying 77 < n^^ip) and so 
Try (tp \ x) / Try (tp) assumes a maximum on this set, say at ipr/(x). There are 
infinitely many values of ip satisfying r\ > 7i>(V>) but clearly we can find rf < r\ 
so that {ip : rf < Tr-qi(tp) < 77} is nonempty and finite. Thus, n-$r(i> \ x) assumes 
its maximum on the set {ip : 7r-qr(ip) < 77} in the subset {ip : 77' < 7r$(V>) < 77}, 
say at ipL(x). Therefore, a Bayes rule S v (x) is given by 5 v (x) = ip v (x) wnen 
%-qf(ip v (x) \x)/Tr^(ip v (x)) > msf (ip'^x) \ x)/rj and 5 v (x) = ip'„(x) otherwise. 
If rj > TT\Sf(S(x)), then 

TT^(S(x) I x)/rj < TT 9 (5(x) I x)/iry(6(x)) < 7Tir(^LRSE(aj) I x)/irq, (V'LRSE (x)) ■ 

Therefore, whenever 77 < 7r\p (t/'lr.se (2;)) the maximizer of (|18[) is given by S(x) — 
V'lrse^) and the result is proved. 

Proof of Theorem 6: Just as in Theorem 2 a Bayes rule S\ iV (x) maximizes 
Tti$,\{8(x) I x)/ max(77, irty t \(6(x))) for S(x) £ Furthermore, as in Theorem 
2, such a rule exists. Now define 77(A) so that < 77(A) < II* (Ba^lrse (&)))• 
Note that 77(A) — > as A — > 0. We have that, as A ->• 0, 

7t^,a(V'a(V'lrse(^)) I x) _ 7t^,a(V'a(V'lrse(^)) I x) 
max(7y(A) , 71* , \ (ip\ (V'lrse (x))) 7t* j a (ip\ (V'lrse (x))) 

_^ KM, (V'LRSE (x) I X) 
TT* (V'LRSE (x)) 
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Let e > 0. Let Ao be such that sup^ e ^diam(S>(^))) < e/2 for all A < Ao. 
Then for A < Ao, and any S(x) satisfying \ \S(x) — '0lrse(^)|| > e, we have 

7T*,a(^a(^))) Jb a (Va(<5(z))) 7r *(^) 

/s A (^(5(x))) 7r *W l/ *(^) 

< sup 7r^,(V>|x) < 7im,(-0 lrse (x) | x) 

~ {H\i>-i>i J RSE(x)\\>t/2} n-q,(ip) n^{iptnSB(x)) 

By (|T9]) and (|20|) there exists Ai < Ao such that, for all A < Ai, 

T^aQaQlRSE^)) I X) > ^ 7Tgr(V> | x) ^ 

T*,A(V'A('0LRSE(a;))) {^:||^-VLRSE(s)||>e/2} 7T*(V0 

Therefore, when A < Ai, a Bayes rule <5A,r/(A)( x ) satisfies 



7r *,A(^A,, ? (A)(a;)) ~ max(7/(A),7r^A(<5A,,,(A)(£))) 

> ^.aQa^LRSeQe)) I Z) _ 7T^ ! a(V'a(V'LRSe(^)) | x) . 

~ max(ry(A),7r* i A('0A(V'LRSE(2:))) tt*,a(V , a('0lrse(x))) 

By (HOI), (EEJ) and fl22) this implies that ||^A,r,(A) - ^lrse(^)|| < e and the 
convergence is established. 

Proof of Corollary 7 : Following the proof of Theorem 6 we have that 
^9,x(^x(x)\x)/w9,x(^x(x)) > ny,x(5\ tV (X)(x)\x)/'Ktr,x(8x,r]w( x )) and so b y 
(j2"0)) . (|2"T|) and ([22]) this implies that 1 1^0*0 ~~ ^LRSe(^)|| < e and the con- 
vergence of il}\(x) to V'lrse(x) is established. 

Proof of Theorem 9: For c > let S c (x) = {^(tp \ x) / n-q, (tjj) > c} and 
5 r) , c (x) = {7r^(-0 | x)/ max(?7, msr(ip)) > c}. Note that S V;C (x) t 5 c (:r) as 77 — >• 0. 

Suppose c is such that Hq,(S c (x) \x) < 7. Then n*(5,,, c (x) | x) < 7 for all 
77 and so S ViC (x) C L vn (x). This implies that 5 c (x) C liminf, J _ s .o L vn (x) and 
since n$(C 7 (x) | x) = 7 this implies that C 7 (x) C liminf^o L rin {x). 

Now suppose c is such that ILf(5 c (ai) x) > 7. Then there exists 770 such 
that for all 77 < 770 we have Ilq,(S T]iC (x) x) > 7. Since L v> ^(x) C S^ )C (x) we have 
that limsup^^o L vn (x) C 5 c (x). Then choosing c = c 7 '(x) for 7' > 7 implies 
that limsup^o ^v,-f( x ) ^ C* 7 '(x). 

Proof of Theorem 10: Let 5 c (x) = {7^ : 7t*(t/> | x) / Try (ip) > c} and 5a, 0(2) = 
{> : Tl 9 (B x (ip) I x)/n*(S A (t/>)) > c}. Recall that 

lim n*(B A W>) I x)/H 9 (B x (il>)) = tt*(^ | x)/tt*(^) 

A->0 

for every "0. If 7r^(?/> | x)/7r^(-0) > c, we have that there exists Ao such that 
for all A < Aq, then n^(i3A(7/') x)/Uq,(B\(ip)) > c and this implies that ip £ 
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lim inf A-fO S\_ c (x). Now II*(7r*(?/> \ x)/Tt^(ip) = c) = and so we have S c {x) C 
liminfA^o Sa,c(x) (after possibly deleting a set of IT*-measure from S c (x)). 
Now, if ip e limsup A _, S A ,c(x), then II* (£ A (VO I x)/II*(.Ba(V>)) 
> c for infinitely many A — > 0, which implies that K-qr(ip\x)/7T-qr(ip) > c, and 
therefore V> S S c (x). This proves S c (x) = limA^o <S A , c (x) (up to a set of LT*- 
measure 0) so that limA->.o ^-^(Sx,c(x)AS c (x) \ x) = for any c. 

Let c\ n (x) = sup{c > : TLy(S\ tC (x) | x) > 7} so (;e)(x) = C^x) and 
Sx,ca (a;) (a;) = Ca, 7 (x). Then we have that 

n*(C 7 (x)ACA, 7 (x) | x) = U-q,(S Cy ^(x)AS x ,c K ^(x)( x ) \ x) 
< ^{S c ^(x)(x)AS x ^(x)(x) I x) + n*(S*A !CA ^(x)AS x , c ^(x)( x ) I x )- (23) 

Since S Cy ( x) (x) = limx-^o Sa.c^x)^) we nave n *(S' CT ( x )(a:)AS'A, C7 ( x )(x) | x) -> 
and II* (Sa,c (a;) O^) I x ) —> n*^ (^(x) x) — 7 as A —> 0. Now consider the sec- 
ond term in (|23|) . Since 7r* ("0 | x) / 7r* (-0) has a continuous posterior distribution, 
we have II* (tt*^ | x)/7r*(^>) > c | x) is continuous in c. Let e > and note that 
for all A small enough, II* (^.^(^(x) | x) < 7 and II* (^^(^(x) |x) > 7 
which implies that c 7 + e (x) < c A , 7 (x) < c 7 _ e (x) and therefore Sx t(>1+ Jx)( x ) C 
S'a.ca^Cx) C 5'a !C7 _ £ ( :z: )(x). As S^cx.^x)^) C 5 , a !C7 ( :e )(x) or Sx, cx ^( x )( x ) 3 
5 , A,c 7 (a : )(a;) then 

n^Sx,^ it(X )(x)as a , Ct ( :E )(x) I x) = |n*(5A^ A t(3; )(x) | x) - ims^^x) | x)|. 

For all A small II* (Sa,^ ^(x)( x ) | x) — n*(5 l A jC ^(a;)(x) x) is bounded above by 

max{|IT*(S'A !C7+£ ( :E )(x) I x) - U^(Sx,c^(x)(x) | x)|, 
|n*(S , A, C7 _,(x)(a;) I x) - n*(S , A !C7 ( :E) (x) I x)|} 

and this upper bound converges to e as A — > 0. Since e is arbitrary we have that 
the second term in ([25)1 goes to as A — s> and this proves the result. 

Proof of Theorem 11: Suppose, without loss of generality that < 7 < 1. 
Let e > and 6 > satisfy 7 + 5 < 1. Put 7' (A, 7) = n*(C A , 7 (x) | x),7"(A,7) 
= n*(C 7+ 5(x) I x) and note that 7 7 (A, 7) > 7, 7" (A, 7) > 7 + S. By Theorem 
10 we have that C A , 7 (x) — > C 7 (x) and Ca, 7 +<5(x) — > C 1+ $(x) as A — > so 
7'(A,7) — > 7 and 7" (A, 7) — >• 7 + (5 as A — > 0. This implies that there is a A (<5) 
such that for all A < A (<5) then 7'(A,7) < 7" (A, 7). Therefore, by Theorem 9, 
we have that for all A < \q(5) 

Ca, 7 (x) C liminf L^ iA . 7 '(a, 7 )(x) C limsupL, )iA , 7 '(A, 7 )(x) C C\. 1+S (x). (24) 

From (1241) and Theorem 10 we have that C-v(x) C lim inf lim inf L„ A ~rr\ vi(x) 

' a^o n^o " ' ' y '" 

C limsuplimsupL IJ A, 7 '(A, 7 )( a; ) C C 7 +5(x). Since lim^^o C 7 +5(x) = C 7 (x) this 

a^o v^o 
establishes the result. 
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